大邓和他的Python

查看原文

其他

extruct提取结构化数据

Original 2018-01-20

作者大邓

extruct库

extruct库可以从HTML标记语言中抽取嵌入的metadata数据。目前支持的数据格式有：

w3c的html microdata
嵌入在html中的JSON-LD数据

先看看Microdata和JSON-LD分别是什么样子的数据：

Microdata

<div itemprop="aggregateRating" itemscope="" itemtype="http://schema.org/AggregateRating">
    
    <meta itemprop="worstRating" content="1">
    
    <meta itemprop="bestRating" content="5">
    
    <div class="bbystars-small-yellow">
        
        <div class="fill" style="width: 88%"></div>
    
    </div>
    
    <span itemprop="ratingValue" aria-label="4.4 out of 5 stars">4.4</span>
    
    <meta itemprop="reviewCount" content="305733">

</div>

乍看起来跟普通的html没啥区别，确实我一开始也没觉得有多特别，但是仔细一看，itemprop itemscope itemtype这些关键词是很不一样的。能够指示范围，说明标签的具体属性。这些不同寻常的地方是因为这个html使用了Schema结构化范式，这样标记的数据的网站更容易被搜索引擎公司采集分析。具体可以看看Schema.org官网学习下，网址

https://schema.org/AggregateRating

像这个差评，worstRating

<meta itemprop="worstRating" content="1">

好评

<meta itemprop="bestRating" content="5">

提取Microdata

像这种microdata方式的结构化数据，可以使用extruct库提取html中的数据，返回json格式数据。

from extruct.w3cmicrodata import MicrodataExtractor

mde = MicrodataExtractor()

data = mde.extract(html_content)

print(data)

print(data['items'][0]['properties']['ratingValue'])

{
  'items': [
    {
      'type': 'http://schema.org/AggregateRating',
      'properties': {
        'reviewCount': '305733',
        'bestRating': '5',
        'ratingValue': u'4.4',
        'worstRating': '1'
      }
    }
  ]}

4.4

JSON-LD

html = """
<html>
    <head>
        <title>Some Person Page</title>
    </head>
    <body>
        <h1>This guys</h1>
        <script type="application/ld+json">
        {
        "@context": "http://schema.org",
        "@type": "Person",
        "name": "John Doe",
        "jobTitle": "Graduate research assistant",
        "affiliation": "University of Dreams",
        "additionalName": "Johnny",
        "url": "http://www.example.com",
        "address": {
            "@type": "PostalAddress",
            "streetAddress": "1234 Peach Drive",
            "addressLocality": "Wonderland",
            "addressRegion": "Georgia"
                }
        }        
        </script>
    </body>
</html>"""

这个比较好区分，代码中嵌入着很标准的json格式字符串。一般遇到这种场景，我一般用正则抽取。html中有

<script type="application/ld+json">

标记这个标签是使用的JSON-LD，extruct可以很轻易的就提取出json数据。

提取JSON-LD数据

from extruct.jsonld import JsonLdExtractor
jslde = JsonLdExtractor()

data = jslde.extract(html)

print(data)

[{'@context': 'http://schema.org',
  '@type': 'Person',
  'additionalName': 'Johnny',
  'address': {'@type': 'PostalAddress',
              'addressLocality': 'Wonderland',
              'addressRegion': 'Georgia',
              'streetAddress': '1234 Peach Drive'},
  'affiliation': 'University of Dreams',
  'jobTitle': 'Graduate research assistant',
  'name': 'John Doe',
  'url': 'http://www.example.com'}]